• Fall 2022, DSPA (HS650)
  • SID: 1236
  • UMich E-mail:
  • I certify that the following paper represents my own independent work and conforms with the guidelines of academic honesty described in the UMich student handbook.
  • Remember that students are allowed and encouraged to discuss, on a conceptual level, the problems with your class mates, however, this can not involve the exchange of actual code, printouts, solutions, e-mails or other explicit electronic or paper handouts.

Outline of problem * Include the regular HW project cover page. Start with a one paragraph abstract, followed by an intro/background of the problem, methods, results, discussion/conclusion and acknowledgments, references, in that order. Clearly state the problem you have chosen to investigate. List the resources you used to come up with the project and reference all sources you used to complete the project. * Clearly state your hypotheses, prior to interrogating the data. * Use statistical techniques from the list of techniques we have discussed in the course to convey whether or not there is statistical evidence in support of your original hypotheses. * Explicitly state your approach to answer your research hypotheses. Write all formulas/tests/statistics you need. * Interpret your statistical (numerical) results in a lay back language. Write conclusions and discussions at the end of your report and acknowledge outside help. * * Describe how this project can be extended in the future. * One, two or three people can work on a project as a team. If people team up, everyone must contribute equally to the project and all members must submit separate copies of the project, with their names on top (the names of all team members should be on all submissions). Expectations of team projects are higher. * It’s strongly recommended that you design your study, implement your analytic protocol, conduct that modeling, package and report the findings using RMD e-notebook.

1 Abstract

2 Background

In the past decade, Lake Erie has seen high concentrations of cyanobactera, or bluegreen algae. A Severity index was created to rank the algal blooms that occur each year, with the highest severities occuring in 2011 and 2015 with 10 and 10.5 respectively. Not all of the causes of the algal blooms have been determined, however, through research many causes have been identified. These include nutrient-rich water from waste water treatment plants, farm fields and fertilized lawns, invasive species, and warm shallow water in the lake. Furthermore, scientist consider nitrogen in the form of nitrate, and phosphorus to be the main culprit in bluegreen algae growth. (Dean, 2022)

To reduce the risk of harmful algal blooms, the stats of Michigan has planned to focus on reducing phosphorus loads from waste water treatment plants, and agricultural sources in the River Raisin and Maumee River Watersheds. Furthermore, forming collaborative partnerships to provide assistance to farmers and promote conservation practices. Currently local and state focus is on reducing the growth of harmful algae, but implementation of new policy takes time. (Dean, 2022)

To assist in research several buoys were placed in Lake Erie which take multiple water quality parameters that report to research labs throughout the area. Several of these labs also include field sampling data of physicochemical properties along with bluegreen algae concentrations. Using this data, a predictive model can be trained to predict harmful algal bloom concentrations and determine if the concentration is harmful to human and enviromental health.

3 Methods

3.1 Data Extraction

Data were pulled into rstudio by reading html tables using the rvest package from the ERDDAP scientific database. This database houses data for water quality parameters provided from buoys, field sampling, and laboratory tests. Data were pulled for the year of 2022, although due to time matching, the data within the time periods from August to November were used. Table 1 shows summary statistics for the data prior to any processing.

The water quality parameters were chosen based on availablity and significance. Looking at the columns headers in Table 1, some important parameters of note are chlorophyll mass and flourescense, dissolved oxygen saturation mass and fractional, and phycocyanin flourescence. Looking further into these parameters, chlorophyll is used by bluegreen algae to collect photosynthetically active light and therefore may be important in predicting algae concentration (Robert A. Andersen, n.d.). Dissolved oxygen has been known to be depleted during periods of high algal bloom growth which can affect the growth of aquatic plants and animals (Ting-ting Wu, 2015). finally, Phycocyanin is a non-toxic, water-soluble pigment protein from microalgae that exhibits antioxidant, anti-inflammatory, hepatoprotective, and neuroprotective effects (Morais, 2018).

Table 1. Initial Water Quality Summary Statistics
…1 time longitude latitude chlorophyll_fluorescence fractional_saturation_of_oxygen_in_sea_water mass_concentration_of_blue_green_algae_in_sea_water mass_concentration_of_blue_green_algae_in_sea_water_rfu mass_concentration_of_chlorophyll_in_sea_water mass_concentration_of_oxygen_in_sea_water sea_surface_temperature sea_water_electrical_conductivity sea_water_ph_reported_on_total_scale ammonia phosphate phycocyanin_fluorescence nitrate
Min. : 1.0 Min. :2022-08-04 16:20:00 Min. :82.94 Min. :41.51 Min. :0.1400 Min. : 9.40 Min. :0.0000 Min. :0.220 Min. : 0.000 Min. :0.000790 Min. :280.9 Min. :0.02449 Min. :7.350 Min. : 0.0 Min. : 0.00 Min. : 0.1600 Min. : 0
1st Qu.:247.8 1st Qu.:2022-08-15 08:25:00 1st Qu.:82.94 1st Qu.:41.51 1st Qu.:0.5800 1st Qu.: 75.26 1st Qu.:0.3600 1st Qu.:0.670 1st Qu.: 1.650 1st Qu.:0.006287 1st Qu.:292.0 1st Qu.:0.02650 1st Qu.:7.980 1st Qu.: 35.0 1st Qu.: 48.00 1st Qu.: 0.4600 1st Qu.: 1100
Median :494.5 Median :2022-09-04 09:20:00 Median :82.94 Median :41.51 Median :0.8000 Median : 88.14 Median :0.5100 Median :0.820 Median : 2.745 Median :0.007465 Median :296.9 Median :0.02792 Median :8.135 Median : 46.0 Median : 93.00 Median : 0.7500 Median : 2280
Mean :494.5 Mean :2022-09-07 11:16:14 Mean :82.94 Mean :41.51 Mean :0.9249 Mean : 79.63 Mean :0.6323 Mean :0.949 Mean : 3.338 Mean :0.007200 Mean :294.6 Mean :0.02789 Mean :8.129 Mean : 900.8 Mean : 86.47 Mean : 0.9437 Mean : 4826
3rd Qu.:741.2 3rd Qu.:2022-09-25 03:30:00 3rd Qu.:82.94 3rd Qu.:41.51 3rd Qu.:1.1025 3rd Qu.: 93.74 3rd Qu.:0.7300 3rd Qu.:1.050 3rd Qu.: 4.188 3rd Qu.:0.008452 3rd Qu.:298.0 3rd Qu.:0.02881 3rd Qu.:8.310 3rd Qu.: 63.0 3rd Qu.: 115.00 3rd Qu.: 1.2400 3rd Qu.: 6235
Max. :988.0 Max. :2022-10-31 05:30:00 Max. :82.94 Max. :41.51 Max. :4.0100 Max. :113.60 Max. :5.1800 Max. :5.610 Max. :18.350 Max. :0.011280 Max. :300.2 Max. :0.03693 Max. :8.880 Max. :89750.0 Max. :1965.00 Max. :13.1100 Max. :80840
NA NA NA NA NA NA NA NA NA NA NA NA NA NA’s :3 NA NA’s :39 NA’s :164

3.2 Missing Data

From Table 1, there appears to be missing data in three of the columns. From the missingness map, we can see that there is only a small amount of missing data with the majority of it being in the nitrate data. To train a predictive model the missing data will need to be either removed or imputed. Since the dataset is about 1000 observations, and the predictive model will require large amounts of data to train, imputation was used rather than removing the columns containing missing data. The Amelia package was used to impute the missing data with 5 imputations. The imputed datasets were then averaged to find the mean imputed values. The summary below shows the new base statistics for the dataset.

The Amelia package imputes data by using the expectation maximization algorithm. This algorithm works by first computing the expected value of the log likelihood function with respect to the conditional distribution of Y given X using the parameter estimates of the previous iteration. This is shown as: #’ \[Q ( \theta | \theta^{(t)} ) = E_{Y | X, \theta^{(t)} }[ log #' \left ( L(\theta | X , Y ) \right ];\]
For the maximization step, the expectation is maximized before being used again in the expectation equation. The maximization equations is shown as: # \[\theta^{(t+1)}=\arg\max_{\theta}Q(\theta|\theta^{(t)}).\]

Amelia will create copies of the dataset with new imputed values. The number of copies created will depend on the value for “m” entered, in this case five. The five datasets can then be averaged to determine a mean and variance for the imputed values. Table 2 summarizes the imputed data, which shows that the columns that contained no missing values remained the same, while columns with missing data are slightly different but still have a similiar mean value.

## -- Imputation 1 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21
## 
## -- Imputation 2 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19
## 
## -- Imputation 3 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19
## 
## -- Imputation 4 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21
## 
## -- Imputation 5 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21 22 23
Table 2. Imputed Water Quality Summary
chlorophyll_flourescence_rfu oxygen_saturation_fraction bluegreen_algae_conc_ug.L bluegreen_algae_conc_rfu chlorophyll_conc_kg.m3 oxygen_conc_kg.m3 temp_K elec_cond_s.m pH ammonia_ug.L phosphate_ug.L phycocayanin_flour_rfu nitrate_ug.L
Min. :0.1400 Min. : 9.40 Min. :0.004326 Min. :0.220 Min. : 0.010 Min. :0.000790 Min. :280.9 Min. :0.02449 Min. :7.350 Min. : 1.00 Min. : 1.00 Min. : 0.0301 Min. : 10
1st Qu.:0.5800 1st Qu.: 75.26 1st Qu.:0.360000 1st Qu.:0.670 1st Qu.: 1.650 1st Qu.:0.006287 1st Qu.:292.0 1st Qu.:0.02650 1st Qu.:7.980 1st Qu.: 35.75 1st Qu.: 49.00 1st Qu.: 0.4600 1st Qu.: 1180
Median :0.8000 Median : 88.14 Median :0.510000 Median :0.820 Median : 2.745 Median :0.007465 Median :296.9 Median :0.02792 Median :8.135 Median : 47.00 Median : 93.00 Median : 0.7800 Median : 2435
Mean :0.9249 Mean : 79.63 Mean :0.632261 Mean :0.949 Mean : 3.338 Mean :0.007200 Mean :294.6 Mean :0.02789 Mean :8.129 Mean : 929.72 Mean : 86.79 Mean : 0.9574 Mean : 4711
3rd Qu.:1.1025 3rd Qu.: 93.74 3rd Qu.:0.730000 3rd Qu.:1.050 3rd Qu.: 4.188 3rd Qu.:0.008452 3rd Qu.:298.0 3rd Qu.:0.02881 3rd Qu.:8.310 3rd Qu.: 65.00 3rd Qu.: 115.00 3rd Qu.: 1.3006 3rd Qu.: 6082
Max. :4.0100 Max. :113.60 Max. :5.180000 Max. :5.610 Max. :18.350 Max. :0.011280 Max. :300.2 Max. :0.03693 Max. :8.880 Max. :89750.00 Max. :1965.00 Max. :13.1100 Max. :80840

4 Correlation

##                              chlorophyll_flourescence_rfu
## chlorophyll_flourescence_rfu                         1.00
## oxygen_saturation_fraction                           0.05
## bluegreen_algae_conc_ug.L                            0.58
## bluegreen_algae_conc_rfu                             0.58
## chlorophyll_conc_kg.m3                               1.00
## oxygen_conc_kg.m3                                   -0.08
## temp_K                                               0.27
## elec_cond_s.m                                        0.14
## pH                                                   0.25
## ammonia_ug.L                                        -0.02
## phosphate_ug.L                                      -0.02
## phycocayanin_flour_rfu                              -0.24
## nitrate_ug.L                                         0.03
##                              oxygen_saturation_fraction
## chlorophyll_flourescence_rfu                       0.05
## oxygen_saturation_fraction                         1.00
## bluegreen_algae_conc_ug.L                         -0.20
## bluegreen_algae_conc_rfu                          -0.20
## chlorophyll_conc_kg.m3                             0.05
## oxygen_conc_kg.m3                                  0.93
## temp_K                                            -0.52
## elec_cond_s.m                                     -0.29
## pH                                                 0.68
## ammonia_ug.L                                      -0.13
## phosphate_ug.L                                    -0.13
## phycocayanin_flour_rfu                             0.17
## nitrate_ug.L                                       0.25
##                              bluegreen_algae_conc_ug.L bluegreen_algae_conc_rfu
## chlorophyll_flourescence_rfu                      0.58                     0.58
## oxygen_saturation_fraction                       -0.20                    -0.20
## bluegreen_algae_conc_ug.L                         1.00                     1.00
## bluegreen_algae_conc_rfu                          1.00                     1.00
## chlorophyll_conc_kg.m3                            0.58                     0.58
## oxygen_conc_kg.m3                                -0.20                    -0.20
## temp_K                                            0.16                     0.16
## elec_cond_s.m                                     0.01                     0.01
## pH                                               -0.05                    -0.05
## ammonia_ug.L                                     -0.02                    -0.02
## phosphate_ug.L                                   -0.07                    -0.07
## phycocayanin_flour_rfu                           -0.16                    -0.16
## nitrate_ug.L                                     -0.07                    -0.07
##                              chlorophyll_conc_kg.m3 oxygen_conc_kg.m3 temp_K
## chlorophyll_flourescence_rfu                   1.00             -0.08   0.27
## oxygen_saturation_fraction                     0.05              0.93  -0.52
## bluegreen_algae_conc_ug.L                      0.58             -0.20   0.16
## bluegreen_algae_conc_rfu                       0.58             -0.20   0.16
## chlorophyll_conc_kg.m3                         1.00             -0.08   0.27
## oxygen_conc_kg.m3                             -0.08              1.00  -0.79
## temp_K                                         0.27             -0.79   1.00
## elec_cond_s.m                                  0.14             -0.48   0.65
## pH                                             0.25              0.41   0.20
## ammonia_ug.L                                  -0.02             -0.14   0.14
## phosphate_ug.L                                -0.02             -0.21   0.26
## phycocayanin_flour_rfu                        -0.24              0.25  -0.29
## nitrate_ug.L                                   0.03              0.28  -0.28
##                              elec_cond_s.m    pH ammonia_ug.L phosphate_ug.L
## chlorophyll_flourescence_rfu          0.14  0.25        -0.02          -0.02
## oxygen_saturation_fraction           -0.29  0.68        -0.13          -0.13
## bluegreen_algae_conc_ug.L             0.01 -0.05        -0.02          -0.07
## bluegreen_algae_conc_rfu              0.01 -0.05        -0.02          -0.07
## chlorophyll_conc_kg.m3                0.14  0.25        -0.02          -0.02
## oxygen_conc_kg.m3                    -0.48  0.41        -0.14          -0.21
## temp_K                                0.65  0.20         0.14           0.26
## elec_cond_s.m                         1.00  0.09         0.06           0.09
## pH                                    0.09  1.00        -0.02           0.06
## ammonia_ug.L                          0.06 -0.02         1.00           0.09
## phosphate_ug.L                        0.09  0.06         0.09           1.00
## phycocayanin_flour_rfu               -0.27 -0.04        -0.05          -0.10
## nitrate_ug.L                         -0.23 -0.01        -0.04           0.02
##                              phycocayanin_flour_rfu nitrate_ug.L
## chlorophyll_flourescence_rfu                  -0.24         0.03
## oxygen_saturation_fraction                     0.17         0.25
## bluegreen_algae_conc_ug.L                     -0.16        -0.07
## bluegreen_algae_conc_rfu                      -0.16        -0.07
## chlorophyll_conc_kg.m3                        -0.24         0.03
## oxygen_conc_kg.m3                              0.25         0.28
## temp_K                                        -0.29        -0.28
## elec_cond_s.m                                 -0.27        -0.23
## pH                                            -0.04        -0.01
## ammonia_ug.L                                  -0.05        -0.04
## phosphate_ug.L                                -0.10         0.02
## phycocayanin_flour_rfu                         1.00         0.08
## nitrate_ug.L                                   0.08         1.00
## 
## n= 988 
## 
## 
## P
##                              chlorophyll_flourescence_rfu
## chlorophyll_flourescence_rfu                             
## oxygen_saturation_fraction   0.1467                      
## bluegreen_algae_conc_ug.L    0.0000                      
## bluegreen_algae_conc_rfu     0.0000                      
## chlorophyll_conc_kg.m3       0.0000                      
## oxygen_conc_kg.m3            0.0084                      
## temp_K                       0.0000                      
## elec_cond_s.m                0.0000                      
## pH                           0.0000                      
## ammonia_ug.L                 0.5982                      
## phosphate_ug.L               0.5214                      
## phycocayanin_flour_rfu       0.0000                      
## nitrate_ug.L                 0.4073                      
##                              oxygen_saturation_fraction
## chlorophyll_flourescence_rfu 0.1467                    
## oxygen_saturation_fraction                             
## bluegreen_algae_conc_ug.L    0.0000                    
## bluegreen_algae_conc_rfu     0.0000                    
## chlorophyll_conc_kg.m3       0.1511                    
## oxygen_conc_kg.m3            0.0000                    
## temp_K                       0.0000                    
## elec_cond_s.m                0.0000                    
## pH                           0.0000                    
## ammonia_ug.L                 0.0000                    
## phosphate_ug.L               0.0000                    
## phycocayanin_flour_rfu       0.0000                    
## nitrate_ug.L                 0.0000                    
##                              bluegreen_algae_conc_ug.L bluegreen_algae_conc_rfu
## chlorophyll_flourescence_rfu 0.0000                    0.0000                  
## oxygen_saturation_fraction   0.0000                    0.0000                  
## bluegreen_algae_conc_ug.L                              0.0000                  
## bluegreen_algae_conc_rfu     0.0000                                            
## chlorophyll_conc_kg.m3       0.0000                    0.0000                  
## oxygen_conc_kg.m3            0.0000                    0.0000                  
## temp_K                       0.0000                    0.0000                  
## elec_cond_s.m                0.7104                    0.6989                  
## pH                           0.1454                    0.1379                  
## ammonia_ug.L                 0.4669                    0.4682                  
## phosphate_ug.L               0.0220                    0.0220                  
## phycocayanin_flour_rfu       0.0000                    0.0000                  
## nitrate_ug.L                 0.0365                    0.0383                  
##                              chlorophyll_conc_kg.m3 oxygen_conc_kg.m3 temp_K
## chlorophyll_flourescence_rfu 0.0000                 0.0084            0.0000
## oxygen_saturation_fraction   0.1511                 0.0000            0.0000
## bluegreen_algae_conc_ug.L    0.0000                 0.0000            0.0000
## bluegreen_algae_conc_rfu     0.0000                 0.0000            0.0000
## chlorophyll_conc_kg.m3                              0.0082            0.0000
## oxygen_conc_kg.m3            0.0082                                   0.0000
## temp_K                       0.0000                 0.0000                  
## elec_cond_s.m                0.0000                 0.0000            0.0000
## pH                           0.0000                 0.0000            0.0000
## ammonia_ug.L                 0.5974                 0.0000            0.0000
## phosphate_ug.L               0.5803                 0.0000            0.0000
## phycocayanin_flour_rfu       0.0000                 0.0000            0.0000
## nitrate_ug.L                 0.4065                 0.0000            0.0000
##                              elec_cond_s.m pH     ammonia_ug.L phosphate_ug.L
## chlorophyll_flourescence_rfu 0.0000        0.0000 0.5982       0.5214        
## oxygen_saturation_fraction   0.0000        0.0000 0.0000       0.0000        
## bluegreen_algae_conc_ug.L    0.7104        0.1454 0.4669       0.0220        
## bluegreen_algae_conc_rfu     0.6989        0.1379 0.4682       0.0220        
## chlorophyll_conc_kg.m3       0.0000        0.0000 0.5974       0.5803        
## oxygen_conc_kg.m3            0.0000        0.0000 0.0000       0.0000        
## temp_K                       0.0000        0.0000 0.0000       0.0000        
## elec_cond_s.m                              0.0042 0.0481       0.0030        
## pH                           0.0042               0.4877       0.0534        
## ammonia_ug.L                 0.0481        0.4877              0.0059        
## phosphate_ug.L               0.0030        0.0534 0.0059                     
## phycocayanin_flour_rfu       0.0000        0.1965 0.1057       0.0010        
## nitrate_ug.L                 0.0000        0.8056 0.1912       0.5003        
##                              phycocayanin_flour_rfu nitrate_ug.L
## chlorophyll_flourescence_rfu 0.0000                 0.4073      
## oxygen_saturation_fraction   0.0000                 0.0000      
## bluegreen_algae_conc_ug.L    0.0000                 0.0365      
## bluegreen_algae_conc_rfu     0.0000                 0.0383      
## chlorophyll_conc_kg.m3       0.0000                 0.4065      
## oxygen_conc_kg.m3            0.0000                 0.0000      
## temp_K                       0.0000                 0.0000      
## elec_cond_s.m                0.0000                 0.0000      
## pH                           0.1965                 0.8056      
## ammonia_ug.L                 0.1057                 0.1912      
## phosphate_ug.L               0.0010                 0.5003      
## phycocayanin_flour_rfu                              0.0147      
## nitrate_ug.L                 0.0147

5 Pairs Plot

6 Feature Selection

7 Neural Network

8

9

10 {r model build, message=FALSE, warning=FALSE, echo=FALSE} # dl_model <- keras_model_sequential() # act <- 'relu' # opt <- 'Adam' # loss <- 'mse' # met <- 'mse' # # Add layers to the model # set.seed(55) # dl_model %>% # layer_dense(units = 10, activation = act, input_shape = length(dl_train_input)) %>% # layer_dense(units = 120, activation = act) %>% # layer_dropout(rate = 0.3) %>% # layer_dense(units = 1, activation = 'linear') # # dl_model %>% compile( # loss = loss, # optimizer = opt, # metrics = met # ) # summary(dl_model) # # history <- dl_model %>% fit( # dl_train_mat, # dl_train_output, # validation_split = 0.2, # verbose = 0, # epochs = 200 # ) # # plot(history) # # # test_results <- dl_model %>% evaluate( # dl_test_mat, # dl_test_output, # verbose = 0 # ) # test_results # # # test_predictions <- predict(dl_model, dl_test_mat) # # df <- data.frame(prediction = as.numeric(test_predictions), concentration = dl_test_output) # colnames(df) <- c("prediction","concentration") # plot_ly() %>% # add_markers(data=df, x=~prediction, y=~concentration, # name="Data Scatter", type="scatter", mode="markers") %>% # add_trace(x = c(0,1), y = c(0,1), type="scatter", mode="lines", # line = list(width = 4), name="Ideal Agreement") %>% # layout(title=paste0('Scatterplot (Normalized) Observed vs. Predicted Values, Cor(Obs,Pred)=', # round(cor(df$prediction,df$concentration ), 2)), # xaxis = list(title="NN (hidden=4) Predictions"), # yaxis = list(title="(Normalized) Observed"), # legend = list(orientation = 'h')) # # cor(df$prediction, df$concentration) # # #

##        loss         mse 
## 0.002744703 0.002744703

11 References